library(GENIE3)
library(doParallel)
library(igraph)
library(tidyverse)
library(DT)
library(reticulate)
count_matrix <- readRDS("./../data/simatx.RDS")
adjm <- read.table("./../data/adjacency_matrix.csv", header = T, row.names = 1, sep = ",")
marker <- read.table("./../data/Tcell.marker.csv", header = T, sep = ",")
count_matrices <- list()
for (i in 1:5) {
count_matrix_i <- as.data.frame(count_matrix[[i]])
colnames(count_matrix_i) <- colnames(adjm)
rownames(count_matrix_i) <- paste("cell", 1:nrow(count_matrix_i), sep = "")
count_matrices[[i]] <- count_matrix_i
}
count_matrices[[1]] %>%
datatable(extensions = 'Buttons',
options = list(
dom = 'Bfrtip',
buttons = c('csv', 'excel'),
scrollX = TRUE,
pageLength = 10),
caption = "Simulated count matrix")
Inferring Gene Regulatory Networks (GRNs) from gene expression data is a challenging task, typically tackled using machine learning algorithms. Both GENIE3 and GRNBoost are widely used methods for GRN inference, based on ensemble learning models. GENIE3 employs random forest regression, while GRNBoost uses gradient boosting—each offering unique strengths for this problem.
GENIE3, which was the top-performing method in the DREAM5 challenge for GRN inference, utilizes random forest regression. Random forests are an ensemble method that constructs a large number of decision trees during training and outputs the mean prediction (for regression) of the individual trees.
For GRN inference, the random forest algorithm in GENIE3 is used as follows: 1. For each target gene \(g\), a random forest is trained where the expression of the target gene \(g\) is predicted using the expression levels of all potential transcription factors (TFs) in the dataset. 2. Each tree in the random forest randomly samples a subset of the available features (transcription factors) and a bootstrap sample of the training data. 3. The importance of each transcription factor for predicting the target gene is measured by aggregating the feature importance scores across all trees.
The final output is a ranked list of transcription factors for each target gene, where the importance score reflects the strength of the regulatory relationship.
The random forest model can be described as: \[ f(x) = \frac{1}{T} \sum_{t=1}^{T} h_t(x) \] Where \(T\) is the number of trees, and \(h_t(x)\) is the prediction from the \(t\)-th tree. The importance of a transcription factor \(TF_i\) for predicting \(g\) is calculated based on the decrease in impurity (e.g., Gini impurity or variance reduction) across all splits where \(TF_i\) is used.
This approach is advantageous because: - Non-linear relationships: Random forests can model complex, non-linear interactions between transcription factors and target genes. - Robustness to noise: By averaging across many trees, random forests reduce the likelihood of overfitting to noise in the data.
Let \(X = \{x_1, x_2, ..., x_p\}\) be the matrix of transcription factor expressions and \(y_g\) be the expression of the target gene \(g\). For each \(g\), the random forest minimizes the mean squared error (MSE): \[ MSE = \frac{1}{n} \sum_{i=1}^{n} \left( y_g^{(i)} - f(X^{(i)}) \right)^2 \] Where \(n\) is the number of samples, and \(f(X^{(i)})\) is the predicted expression for sample \(i\).
In single-cell RNA sequencing (scRNA-seq) data, zero-inflation is a common issue, where many genes have expression levels recorded as zero across a large number of cells. This sparsity can challenge many traditional statistical models. However, GENIE3 is based on tree-based methods, specifically Random Forests, which are inherently robust to sparse data.
While GENIE3 does not explicitly model zero-inflation, Random Forests naturally handle zero-inflated data due to their ability to partition data based on splits at specific thresholds, thus segregating zeros from non-zero values without overfitting to the zeros. This makes GENIE3 well-suited to noisy and sparse data like scRNA-seq.
GENIE3 constructs a Random Forest for each target gene, where the goal is to predict its expression level based on the expression of all other genes. Random Forest is an ensemble method that builds multiple decision trees during training. Each tree is constructed from a random subset of the data, and the final prediction is made by averaging the results (for regression) or taking a majority vote (for classification).
In the context of GENIE3:
This method allows GENIE3 to infer gene regulatory networks from expression data, making it a powerful tool for discovering potential regulatory relationships, especially in high-dimensional, sparse datasets like scRNA-seq.
set.seed(123)
regulatory_network_genie3 <- GENIE3(t(count_matrices[[1]]))
# Extract link list (gene regulatory interactions) from GENIE3 results
link_list_genie3 <- getLinkList(regulatory_network_genie3)
link_list_genie3 %>%
datatable(extensions = 'Buttons',
options = list(
dom = 'Bfrtip',
buttons = c('csv', 'excel'),
scrollX = TRUE,
pageLength = 10),
caption = "GENIE3 output")
write.csv(link_list_genie3, "genie3_network.csv", row.names = FALSE)
gene_names <- unique(c(link_list_genie3$regulator, link_list_genie3$target))
adj_matrix_genie3 <- matrix(0, nrow = length(gene_names), ncol = length(gene_names))
rownames(adj_matrix_genie3) <- colnames(adj_matrix_genie3) <- gene_names
# Fill the adjacency matrix based on the links from GENIE3 with a weight condition
for (i in 1:nrow(link_list_genie3)) {
regulator <- link_list_genie3$regulator[i]
target <- link_list_genie3$target[i]
weight <- link_list_genie3$weight[i]
# Only set 1 if the weight is >= 0.1
if (weight >= 0.1) {
adj_matrix_genie3[regulator, target] <- 1
}
}
adj_matrix_genie3 %>%
datatable(extensions = 'Buttons',
options = list(
dom = 'Bfrtip',
buttons = c('csv', 'excel'),
scrollX = TRUE,
pageLength = 10),
caption = "GENIE3 adjacency matrix 0.1")
write.csv(adj_matrix_genie3, "genie3_adjacency_matrix.csv")
# Create igraph objects for both networks
graph_genie3 <- graph_from_adjacency_matrix(adj_matrix_genie3, mode = "undirected")
graph_provided <- graph_from_adjacency_matrix(as.matrix(adjm), mode = "undirected")
par(mfrow = c(1, 2))
# Plot GENIE3 Network
plot(graph_genie3, main = "GENIE3 Inferred Network", vertex.label.color = "black",
vertex.size = 10, edge.arrow.size = 0.5, vertex.label.cex = 0.7)
# Plot Provided Network
plot(graph_provided, main = "Provided Network", vertex.label.color = "black",
vertex.size = 10, edge.arrow.size = 0.5, vertex.label.cex = 0.7)
par(mfrow = c(1, 1))
GRNBoost, an alternative to GENIE3, employs gradient boosting machines (GBM), a method that iteratively builds a sequence of shallow decision trees, where each new tree corrects the errors made by the previous ensemble. It is based on boosting, where weak learners (typically decision stumps or shallow trees) are sequentially added, and the model is updated by minimizing a loss function.
The steps in GRNBoost are: 1. Similar to GENIE3, a regression model is built for each gene \(g\) to predict its expression using the expression levels of candidate transcription factors. 2. Instead of using random forests, GRNBoost uses gradient boosting, where each new tree is added to reduce the residual errors of the previous trees. 3. The importance of each transcription factor is computed based on how much each tree reduces the loss function when it includes that TF as a predictor.
Gradient boosting can be mathematically formulated as: \[ f(x) = \sum_{m=1}^{M} \lambda h_m(x) \] Where \(h_m(x)\) is a decision tree at step \(m\), \(\lambda\) is the learning rate, and \(M\) is the total number of trees. The loss function is typically the squared error for regression tasks: \[ L(y, f(x)) = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - f(x_i) \right)^2 \] The model is updated by adding a new tree that reduces the gradient of this loss function: \[ \hat{f}_m(x) = \hat{f}_{m-1}(x) + \lambda h_m(x) \]
The main advantages of gradient boosting in GRNBoost include: - Efficiency: By iteratively refining the model with shallow trees, gradient boosting can achieve high accuracy without requiring deep trees like in random forests. - Early stopping: GRNBoost2 implements an early stopping mechanism based on the out-of-bag improvement estimates, which helps avoid overfitting and reduces computation time.
For a target gene \(g\), the expression is modeled as: \[ f(X) = \sum_{m=1}^{M} \lambda h_m(X) \] Where each \(h_m(X)\) is a shallow tree built to minimize the loss \(L(y_g, f(X))\), where \(f(X)\) represents the predicted expression values for gene \(g\).
In single-cell RNA sequencing (scRNA-seq) data, the high occurrence of zero expression values, or zero-inflation, is a significant challenge for many computational models. Zero-inflation occurs when a large number of genes are either not expressed or their expression levels are too low to detect in many cells, leading to sparse datasets.
GRNBoost2 is built on Gradient Boosting Machines (GBMs), which are tree-based models. Like Random Forests, GBMs can handle sparsity well because tree-based methods make decisions by splitting data based on feature values. This allows them to naturally partition and differentiate between zero and non-zero values without requiring explicit modeling of zero-inflation. The ability of GRNBoost2 to efficiently handle sparse data makes it a strong tool for gene regulatory network (GRN) inference in scRNA-seq datasets.
GRNBoost2 is an implementation of the Gradient Boosting Machine (GBM) algorithm, which is an ensemble learning method that builds models sequentially. Each model attempts to correct the errors of the previous models, creating a stronger predictive model over time. In GRNBoost2, GBMs are used to infer gene regulatory networks by predicting the expression of a target gene based on the expression of other genes.
Here’s how it works in the context of GRNBoost2:
The sequential learning nature of Gradient Boosting allows GRNBoost2 to refine predictions iteratively, making it well-suited to complex data structures like those found in high-dimensional gene expression datasets.
#sudo apt-get install python3-venv
use_python("/usr/bin/python3", required = TRUE)
py_config()
## python: /usr/bin/python3
## libpython: /usr/lib/python3.8/config-3.8-x86_64-linux-gnu/libpython3.8.so
## pythonhome: //usr://usr
## version: 3.8.10 (default, Sep 11 2024, 16:02:53) [GCC 9.4.0]
## numpy: /home/francescoc/.local/lib/python3.8/site-packages/numpy
## numpy_version: 1.22.4
##
## NOTE: Python version was forced by RETICULATE_PYTHON
arboreto <- import("arboreto.algo")
pandas <- import("pandas")
numpy <- import("numpy")
count_matrix_df <- as.data.frame(count_matrices[[1]])
genes <- colnames(count_matrix_df)
df_pandas <- pandas$DataFrame(data = count_matrix_df, columns = genes, index = rownames(count_matrix_df))
grn_links <- arboreto$grnboost2(df_pandas, gene_names = genes)
grn_links %>%
datatable(extensions = 'Buttons',
options = list(
dom = 'Bfrtip',
buttons = c('csv', 'excel'),
scrollX = TRUE,
pageLength = 10),
caption = "GRNBoost2 links")
unique_genes <- unique(c(grn_links$TF, grn_links$target)) # Get unique genes from GRNBoost2
adj_matrix_grnboost <- matrix(0, nrow = length(unique_genes), ncol = length(unique_genes))
rownames(adj_matrix_grnboost) <- unique_genes
colnames(adj_matrix_grnboost) <- unique_genes
for (i in 1:nrow(grn_links)) {
tf <- grn_links$TF[i]
target <- grn_links$target[i]
adj_matrix_grnboost[tf, target] <- 1 # Set the edge in the adjacency matrix
}
adj_matrix_original <- as.matrix(adjm)
graph_grnboost <- graph_from_adjacency_matrix(adj_matrix_grnboost, mode = "undirected")
graph_original <- graph_from_adjacency_matrix(adj_matrix_original, mode = "undirected")
# Set up side-by-side plotting
par(mfrow = c(1, 2))
# Plot GRNBoost2 Network
plot(graph_grnboost, main = "GRNBoost2 Inferred Network", vertex.label.color = "black",
vertex.size = 10, edge.arrow.size = 0.5, vertex.label.cex = 0.7)
# Plot Original Network
plot(graph_original, main = "Original Network", vertex.label.color = "black",
vertex.size = 10, edge.arrow.size = 0.5, vertex.label.cex = 0.7)
# Reset plotting layout
par(mfrow = c(1, 1))
Both GENIE3 and GRNBoost provide powerful and scalable methods for inferring GRNs from high-dimensional gene expression data. GENIE3 leverages the power of random forests to capture complex relationships, while GRNBoost uses gradient boosting for computational efficiency. These methods are widely applicable, especially in the context of large datasets from single-cell RNA sequencing experiments, enabling high-resolution understanding of gene regulatory dynamics.